Update

Last Update: 12.05.2024

Updated majority of the plots to use ggplot2 and its friends rather than base R, changed formatting (reorganized headers and subheaders), changed html theme to spacelab, added section 4

Set-Up

# reproducability
set.seed(5110345)

# no scientific notation
options(scipen = 999)

(1) Introduction

1.1 Background on the English Premier League

The English Premier League (EPL) is widely regarded as the world’s most prominent professional men’s soccer league. Founded in 1992 when top English clubs broke away from the Football League First Division, the EPL has become a global phenomenon. Renowned for its lucrative TV broadcasting deals and its showcase of world-class talent, the EPL has attracted massive investment from both club owners and commercial interests in recent years. A notable example is the Qatari ownership of Manchester City, which has transformed the club into a dominant force in the league. This increase in investment has led to a dramatic rise in player valuations, with clubs willing to spend record-breaking sums to secure top talent. This is done through the transfer market, where players are bought and sold between the clubs.

The transfer market is open during specific windows: the summer months of June, July, and August, along with a mid-season window in January. During these periods, clubs negotiate transfer fees which represent the agreed amount for one club to acquire a player from another. Transfer fees are influenced by factors such as player performance, contract length, and market demand.

As a result of high levels of investment, the EPL’s transfer market has become one of the most competitive and volatile in the world. Player values can fluctuate significantly based on a combination of on-field form, media narratives, and external market forces, such as the interests of wealthy investors or the potential for future success in international tournaments. Clubs’ heavy investment in the market has not only intensified the level of competition but also increased the financial stakes, making player valuations one of the most dynamic aspects of the league.

1.2 Dataset

The dataset used for this project was created using the following websites:

The dataset contains data on 241 EPL attackers and midfielders, along with their performance metrics for the 2023-2024 EPL season. Additionally, the dataset contains the transfer market valuation of each player (the target variable). Here are the variables included in the dataset:

Player Information

  • Player: Player name

  • Nation: Nationality of the player

  • Pos: Position most commonly played by the player (Categorical, 1 if forward and 0 if midfielder)

  • Squad: Squad of the player

  • Age: Age at the start of the season

  • Born: Year of birth

Playing Time

  • MP: Amount of matches played (out of 38 possible)

  • Starts: Amount of matches started (out of 38 possible)

  • Min: Total amount of minutes played

  • 90s: Total amount of minutes played divided by 90

Performance

  • Gls: Total amount of goals scored

  • Ast: Total amount of assists provided

  • G+A: Sum of Gls and Ast

  • G-PK: Non-penalty goals

  • PK: Total amount of penalty goals scored

  • PKatt: Total amount of penalty kicks attempted

Discipline

  • CrdY: Total amount of yellow cards

  • CrdR: Total amount of red cards

Expected Performance

These metrics are calculated using historical data and probabilistic models to assess the quality of a player’s chance or contributions. These metrics provide a more accurate representation of a player’s true performance potential.

For example, an xG of 0.43 means the player had a 43% chance of scoring given their position and the context of the shot.

All of these are provided by Opta, one of the official statistical partners of the EPL.

  • xG: Expected goals scored

  • npxG: Non-penalty expected goals scored

  • xAG: Expected assists provided

  • npxG+xAG: Sum of npxG and xAG

Progression

  • PrgC: A progressive carry occurs when a player moves the ball forward at least 10 yards or into the final third of the pitch while dribbling, measuring how effectively a player can carry the ball and advance play

  • PrgP: A progressive pass is a pass that moves the ball at least 10 yards forward or into the attacking third, reflecting a player’s ability to make attacking passes that drive the play forward

  • PrgR: Tracks how often a player receives a pass that advances the ball by at least 10 yards or into the attacking third, showing how involved a player is in progressing the play when receiving the ball

Target Variable

  • Trnsfr: Transfer market value of a player, expressed in millions of pounds (£)

This dataset is likely independently and identically distributed (i.i.d.) since it includes only players who participated in the 2023-2024 Premier League season, with each player’s performance treated as independent. Additionally, individual player performances are assumed to be unaffected by the performance of others, reinforcing the independence of observations.

1.3 Objective

The goal of this analysis is to understand the relationship between player performance metrics and transfer market value in the English Premier League. As mentioned, transfer fees are often influenced by external factors such as contract duration and social media presence. This analysis narrows the focus to mainly on-field performance metrics such as goals scored, assists provided, and progressive carries. By using statistical modeling techniques, the project will explore how various metrics contribute to a player’s valuation. Furthermore, challenges such as high collinearity, variable selection, and assumption validation will be addressed to ensure a robust and interpretable model.

(2) Methodology

2.1 Libraries and Load

library(readxl)   # loading dataset
library(dplyr)    # subsetting
library(ggplot2)  # plotting
library(reshape2) # plotting
library(GGally)   # plotting
library(cowplot)  # plotting
library(knitr)    # tables
library(psych)    # summary statistics
library(faraway)  # VIF
library(lmtest)   # BP Test
library(sandwich) # robust statistics
# load dataset
soccer = read_xlsx("epl.dataset.xlsx")

2.2 Feature Engineering

# big six vector
big_six = c("Arsenal", 
            "Chelsea", 
            "Liverpool", 
            "Manchester City", 
            "Manchester Utd", 
            "Tottenham")

# create `six` col
soccer$Six = ifelse(soccer$Squad %in% big_six, 1, 0)

Explanation

We create a variable named Six, which is an indicator variable indicating whether or not a player plays for a “Big Six” club. The “Big Six” clubs in England are the six most successful and affluent clubs in EPL history. They are:

  • Arsenal F.C. (Football Club)

  • Chelsea F.C.

  • Liverpool F.C.

  • Manchester City

  • Manchester United

  • Tottenham Hotspur F.C.

The reason we do this is because Big Six players tend to have above average transfer market values, so this may influence the transfer market value.

# coerce `Nation` col into factor
soccer$Nation = as.factor(soccer$Nation)

# all of the factor levels
levels(soccer$Nation)
##  [1] "al ALB"  "ar ARG"  "at AUT"  "be BEL"  "bf BFA"  "br BRA"  "cd COD" 
##  [8] "ch SUI"  "ci CIV"  "cl CHI"  "cm CMR"  "co COL"  "cz CZE"  "de GER" 
## [15] "dk DEN"  "dz ALG"  "ec ECU"  "eg EGY"  "eng ENG" "es ESP"  "fr FRA" 
## [22] "ga GAB"  "gd GRN"  "gh GHA"  "gw GNB"  "hr CRO"  "hu HUN"  "ie IRL" 
## [29] "is ISL"  "it ITA"  "jm JAM"  "jp JPN"  "kr KOR"  "ma MAR"  "ml MLI" 
## [36] "mx MEX"  "ng NGA"  "nir NIR" "nl NED"  "no NOR"  "nz NZL"  "pl POL" 
## [43] "pt POR"  "py PAR"  "rs SRB"  "sct SCO" "se SWE"  "sn SEN"  "tn TUN" 
## [50] "tr TUR"  "ua UKR"  "uy URU"  "wls WAL" "za RSA"  "zw ZIM"

Explanation

We coerce the Nation variable into a factor, with each level of the factor representing a country.

Note that while factor variables typically have fewer levels to maintain model interpretability, the country of origin could be a significant factor influencing transfer market value.

2.3 Exploratory Data Analysis

The goal of the exploratory data analysis (EDA) is to gain insights into the relationships between the response variable and some of the most linearly correlated variables. This will be accomplished using visualizations and summary statistics.

2.3.1 Correlation Plot

# select numeric cols
numeric_data = soccer %>% select_if(is.numeric)
# calculate correlation matrix
cor_matrix = cor(numeric_data)

# reshape cor_matrix into long format
cor_long = melt(cor_matrix)

# create corrplot using ggplot2
ggplot(cor_long, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile(color = "white", size = 0.5) +  # borders to files
  scale_fill_gradient2(low = "dodgerblue", 
                       high = "darkred", 
                       mid = "white", 
                       midpoint = 0, 
                       name = "Correlation") +
  geom_text(aes(label = sprintf("%.2f", value)), 
            color = "black", size = 2) +  # add correlation vals
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),  # rotates x-axis labels
        axis.text.y = element_text(hjust = 1),              # aligns y-axis labels
        panel.grid = element_blank(),                       # removes grid lines
        axis.title = element_blank())

Explanation

  • numeric_data selects the numeric cols in the soccer dataset

  • cor_matrix calculates a correlation matrix for the cols in numeric_data

  • cor_long reshapes the matrix into a long format using the function melt()

  • ggplot2 is used to create the correlation plot

2.3.2 Scatterplots

# define the response variable Trnsfr
response_variable = "Trnsfr"

# generate list of variables 
predictors = setdiff(colnames(numeric_data), response_variable)

# loop
for (predictor in predictors) {
  
  # create scatterplot
  plot(numeric_data[[predictor]], numeric_data[[response_variable]], 
       main = paste("Scatterplot: ", predictor, "vs.", response_variable),
       xlab = predictor, ylab = response_variable,
       pch = 20, col = "dodgerblue")
  
  # grid-lines
  grid()
}

Explanation

  • response_variable contains the string for the response variable

  • predictors is a list of variables generated by using setdiff(), which excludes the response_variable from selection

  • The loop iterates through each of the predictor variables in predictors and uses plot() to plot scatterplots of each variable against the response

2.3.3 Boxplots and Barplots

# boxplot of `Pos` on `Trnsfr`
ggplot(soccer, aes(x = factor(Pos), y = Trnsfr)) +
  geom_boxplot(fill = "dodgerblue", color = "orange", width = 0.25) +
  labs(title = "Boxplot: Pos on Trnsfr", x = "Pos (1 if forward, 0 else)", y = "Transfer Market Value") +
  theme_minimal()

# barplot of `Pos`
ggplot(soccer, aes(x = factor(Pos))) +
  geom_bar(fill = "dodgerblue", color = "orange", width = 0.25) +
  labs(title = "Barplot: Pos", x = "Pos (1 if forward, 0 else)", y = "Count of Players") +
  theme_minimal()

# boxplot of `Six` on `Trsnfr`
ggplot(soccer, aes(x = factor(Six), y = Trnsfr)) +
  geom_boxplot(fill = "dodgerblue", color = "orange", width = 0.25) +
  labs(title = "Boxplot: Six on Trnsfr", x = "Six (1 if plays for big six, 0 else)", y = "Transfer Market Value") +
  theme_minimal()

# barplot of `Six`
ggplot(soccer, aes(x = factor(Six))) +
  geom_bar(fill = "dodgerblue", color = "orange", width = 0.25) +
  labs(title = "Barplot: Six", x = "Six (1 if plays for big six, 0 else)", y = "Count of Players") +
  theme_minimal()

# barplot of `Nation`
ggplot(soccer, aes(x = factor(Nation))) +
  geom_bar(fill = "dodgerblue", color = "orange") +
  labs(title = "Barplot: Nation", x = "Nation (Factor)", y = "Count of Players") +
  theme_minimal() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# barplot of `Nation` on `Trsnfr`
ggplot(soccer, aes(x = factor(Nation), y = Trnsfr)) +
  geom_bar(stat = "identity", fill = "dodgerblue", color = "orange") +
  labs(title = "Barplot: Transfer Market Value by Nation", x = "Nation (Factor)", y = "Transfer Market Value") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Explanation

Here we focus on the categorical and factor variables in the dataset

  • Boxplots are generated using the geom_boxplot() function from ggplot2

  • Barplots are generated using the geom_bar() function from ggplot2

  • The last barplot fits Nation with Trnsfr as the y-value of the plot to show the difference in transfer values by nation

2.3.4 Descriptive Statistics

# generate descriptive statistics
statistics = psych::describe(numeric_data)

# drop unnecessary cols
statistics = statistics %>%
  select(-vars, -n, -mad, -skew, -kurtosis, -se)

# round for clarity
statistics = round(statistics, 5)

# col names
colnames(statistics) = c("Mean", 
                         "StDev", 
                         "Median", 
                         "Trimmed Mean", 
                         "Min", 
                         "Max", 
                         "Range")

# neat table
kable(statistics, booktabs = TRUE, align = "r")

Explanation

  • psych is used to generate the descriptive statistics

  • dplyr is used to filter the statistics and drop the unnecessary ones such as skew and kurtosis

  • round() is used for clarity and colnames() renames the columns of the dataframe

  • kable() is used to generate the descriptive statistics table

2.3.5 Distribution of Response Variable

ggplot(numeric_data, aes(x = Trnsfr)) + 
  geom_histogram(binwidth = 10, fill = "dodgerblue", color = "orange", alpha = 1) +
  labs(title = "Histogram: Trnsfr",
       x = "Trnsfr",
       y = "Frequency") +
  theme_minimal()
  • ggplot2 is used to generate a simple histogram of the response variable Trnsfr

2.4 Baseline Model Development

More Data Manipulation

# Add the 'Nation' column back
numeric_data = bind_cols(numeric_data, soccer %>% select(Nation))
numeric_data$Nation = as.factor(numeric_data$Nation)

Explanation

  • We add Nation back to our numeric_data dataframe as a factor

2.4.1 Baseline Model

Step-Wise AIC and BIC Selection

# size for BIC 
n = nrow(numeric_data)

# fit full additive model
test = lm(Trnsfr ~ ., data = numeric_data) # . includes all predictors

# backwards AIC Search
aic_back_mod = step(test, direction = "backward", trace = 0)

# backwards BIC Search
bic_back_mod = step(test, direction = "backward", trace = 0, k = log(n))

# exhaustive AIC search
aic_exh_mod = step(test, direction = "both", trace = 0)

# exhaustive BIC Search
bic_exh_mod = step(test, direction = "both", trace = 0, k = log(n))

aic_back_mod$call
## lm(formula = Trnsfr ~ Age + MP + Min + `90s` + Gls + Ast + `G-PK` + 
##     PKatt + xG + npxG + `npxG+xAG` + PrgP + PrgR + Six, data = numeric_data)
bic_back_mod$call
## lm(formula = Trnsfr ~ Age + MP + Ast + `G-PK` + PKatt + PrgP + 
##     Six, data = numeric_data)
aic_exh_mod$call
## lm(formula = Trnsfr ~ Age + MP + Min + `90s` + Gls + Ast + `G-PK` + 
##     PKatt + xG + npxG + `npxG+xAG` + PrgP + PrgR + Six, data = numeric_data)
bic_exh_mod$call
## lm(formula = Trnsfr ~ Age + MP + Ast + `G-PK` + PKatt + PrgP + 
##     Six, data = numeric_data)

Explanation

  • n is the number of observations in the numeric_data dataframe

  • test fits a model containing all of the variables which will be used for the search procedures

  • aic_back_mod is the backwards AIC step search model

  • bic_back_mod is the backwards BIC step search model

  • aic_exh_mod is the exhaustive AIC step search model

  • bic_exh_mod is the exhaustive BIC step search model

  • Note we have trace = 0 for all searches in order to prevent redundant outputs

  • Note the BIC searches have the argument k = log(n), which tells the step() function to use BIC rather than AIC

Resulting Model and VIF

result_mod = lm(formula = Trnsfr ~ Age + MP + Min + `90s` + Gls + Ast + `G-PK` + 
    PKatt + xG + npxG + `npxG+xAG` + PrgP + PrgR + Six, data = numeric_data)

vif(result_mod)
##           Age            MP           Min         `90s`           Gls 
##      1.094068      4.792636 118351.607948 118324.979371    907.291828 
##           Ast        `G-PK`         PKatt            xG          npxG 
##      4.988636    711.244706   3769.590906  62074.496386  45479.378926 
##    `npxG+xAG`          PrgP          PrgR           Six 
##     36.931231      4.217821      3.713216      1.370008

Explanation

  • result_mod is the model obtained from the searches

  • We use the vif() function from the library faraway to obtain the VIF for result_mod

Subsetting and Dropping Using Partial Correlation Coefficients

Initial Subset
# subset for step vars
sub = subset(numeric_data, select = c(Trnsfr,
                                      Age, 
                                      MP, 
                                      Min, 
                                      `90s`, 
                                      Gls,
                                      Ast, 
                                      `G-PK`, 
                                      PKatt, 
                                      xG, 
                                      npxG, 
                                      `npxG+xAG`, 
                                      PrgP,
                                      PrgR, 
                                      Six))
Variables of Interest are Min and 90s
# regress response on all preds except preds of interest
less_min_mod = lm(Trnsfr ~ . - Min, data = sub)
less_ninty_mod = lm(Trnsfr ~ . - `90s`, data = sub)

# regress preds of interest against all preds
min_mod = lm(Min ~ . - Trnsfr, data = sub)
ninty_mod = lm(`90s` ~ . - Trnsfr, data = sub)

# corr of response and min
cor(resid(less_min_mod), resid(min_mod))
## [1] 0.1305785
# corr of response and 90s
cor(resid(less_ninty_mod), resid(ninty_mod))
## [1] -0.1296478
Drop 90s
# drop 90s from the subset
sub = subset(sub, select = -`90s`)
Model Without 90s
# without 90s
model_1 = lm(formula = Trnsfr ~ Age + MP + Min + Gls + Ast + `G-PK` + 
    PKatt + xG + npxG + `npxG+xAG` + PrgP + PrgR + Six, data = numeric_data)

vif(model_1)
##          Age           MP          Min          Gls          Ast       `G-PK` 
##     1.060381     4.772855     8.163680   907.242010     4.977220   711.236503 
##        PKatt           xG         npxG   `npxG+xAG`         PrgP         PrgR 
##  3767.937623 62044.786587 45458.490578    36.898586     4.202575     3.701442 
##          Six 
##     1.367369
Variables of Interest are xG and npxG
# regress response on all preds except preds of interest
less_xg_mod = lm(Trnsfr ~ . - xG, data = sub)
less_npxg_mod = lm(Trnsfr ~ . - npxG, data = sub)

# regress preds of interest against all preds
xg_mod = lm(xG ~ . - Trnsfr, data = sub)
npxg_mod = lm(npxG ~ . - Trnsfr, data = sub)

# corr of response and xG
cor(resid(less_xg_mod), resid(xg_mod))
## [1] -0.1728598
# corr of response and npxG
cor(resid(less_npxg_mod), resid(npxg_mod))
## [1] 0.1752576
Drop xG
# drop xG from the subset
sub = subset(sub, select = -xG)
Model Without 90s and xG
# without 90s, xG
model_2 = lm(formula = Trnsfr ~ Age + MP + Min + Gls + Ast + `G-PK` + 
    PKatt + npxG + `npxG+xAG` + PrgP + PrgR + Six, data = numeric_data)

vif(model_2)
##        Age         MP        Min        Gls        Ast     `G-PK`      PKatt 
##   1.060365   4.751707   8.161051 740.612741   4.973988 583.949962  48.550683 
##       npxG `npxG+xAG`       PrgP       PrgR        Six 
##  21.272951  36.880765   4.202216   3.648548   1.367119
Variables of Interest are Gls and G-PK
# regress response on all preds except preds of interest
less_gls_mod = lm(Trnsfr ~ . - Gls, data = sub)
less_gpk_mod = lm(Trnsfr ~ . - `G-PK`, data = sub)

# regress preds of interest against all preds
gls_mod = lm(Gls ~ . - Trnsfr, data = sub)
gpk_mod = lm(`G-PK` ~ . - Trnsfr, data = sub)

# corr of response and Gls
cor(resid(less_gls_mod), resid(gls_mod))
## [1] -0.03068311
# corr of response and G-PK
cor(resid(less_gpk_mod), resid(gpk_mod))
## [1] 0.05594432
Drop Gls
# drop Gls from the subset
sub = subset(sub, select = -Gls)
Model Without 90s, xG, Gls
# without 90s, xG, Gls
model_3 = lm(formula = Trnsfr ~ Age + MP + Min + Ast + `G-PK` + 
    PKatt + npxG + `npxG+xAG` + PrgP + PrgR + Six, data = numeric_data)

vif(model_3)
##        Age         MP        Min        Ast     `G-PK`      PKatt       npxG 
##   1.060226   4.751618   8.038287   4.810253   5.434749   1.647932  21.103024 
## `npxG+xAG`       PrgP       PrgR        Six 
##  36.415338   4.132682   3.631521   1.366870
Variables of Interest are npxG and npxG+xAG
# regress response on all preds except preds of interest
less_npxg_mod = lm(Trnsfr ~ . - npxG, data = sub)
less_npxg_ag_mod = lm(Trnsfr ~ . - `npxG+xAG`, data = sub)

# regress preds of interest against all preds
npxg_mod = lm(npxG ~ . - Trnsfr, data = sub)
npxG_ag_mod = lm(`npxG+xAG` ~ . - Trnsfr, data = sub)

# corr of response and npxG
cor(resid(less_npxg_mod), resid(npxg_mod))
## [1] 0.1185239
# corr of response and npxG+xAG
cor(resid(less_npxg_ag_mod), resid(npxG_ag_mod))
## [1] -0.1210452
Drop npxG
# drop npxG from the subset
sub = subset(sub, select = -npxG)
# without 90s, xG, Gls, npxG
model4 = lm(formula = Trnsfr ~ Age + MP + Min + Ast + `G-PK` + 
    PKatt + `npxG+xAG` + PrgP + PrgR + Six, data = numeric_data)

vif(model4)
##        Age         MP        Min        Ast     `G-PK`      PKatt `npxG+xAG` 
##   1.059614   4.699315   8.036784   3.637705   4.590033   1.647739  12.230516 
##       PrgP       PrgR        Six 
##   3.621714   3.119021   1.346061
Variables of Interest are G-PK and npxG+xAG
# regress response on all preds except preds of interest
less_gpk_mod = lm(Trnsfr ~ . - `G-PK`, data = sub)
less_npxg_ag_mod = lm(Trnsfr ~ . - `npxG+xAG`, data = sub)

# regress preds of interest against all preds
gpk_mod = lm(`G-PK` ~ . - Trnsfr, data = sub)
npxG_ag_mod = lm(`npxG+xAG` ~ . - Trnsfr, data = sub)

# corr of response and G-PK
cor(resid(less_gpk_mod), resid(gpk_mod))
## [1] 0.3185485
# corr of response and npxG+xAG
cor(resid(less_npxg_ag_mod), resid(npxG_ag_mod))
## [1] -0.04222297
Drop npxG+xAG
# drop npxG from the subset
sub = subset(sub, select = -`npxG+xAG`)
# without 90s, xG, Gls, npxG, npxG+xAG
model5 = lm(formula = Trnsfr ~ Age + MP + Min + Ast + `G-PK` + 
    PKatt + PrgP + PrgR + Six, data = numeric_data)

vif(model5)
##      Age       MP      Min      Ast   `G-PK`    PKatt     PrgP     PrgR 
## 1.054738 4.697207 7.738169 2.671767 2.468864 1.311474 3.542120 2.586803 
##      Six 
## 1.335559
Variables of Interest are MP and Min
# regress response on all preds except preds of interest
less_mp_mod = lm(Trnsfr ~ . - MP, data = sub)
less_min_mod = lm(Trnsfr ~ . - Min, data = sub)

# regress preds of interest against all preds
mp_mod = lm(MP ~ . - Trnsfr, data = sub)
min_mod = lm(Min ~ . - Trnsfr, data = sub)

# corr of response and MP
cor(resid(less_mp_mod), resid(mp_mod))
## [1] -0.1946723
# corr of response and Min
cor(resid(less_min_mod), resid(min_mod))
## [1] 0.1007477
Drop Min
# drop Min from the subset
sub = subset(sub, select = -Min)
# without 90s, xG, Gls, npxG, npxG+xAG, Min
model6 = lm(formula = Trnsfr ~ Age + MP + Ast + `G-PK` + 
    PKatt + PrgP + PrgR + Six, data = numeric_data)

vif(model6)
##      Age       MP      Ast   `G-PK`    PKatt     PrgP     PrgR      Six 
## 1.043371 2.421840 2.671265 2.189610 1.310128 2.467442 2.570758 1.212519
# rename
baseline_mod = model6

Explanation

Although this code may look overcomplicated, it is actually quite simple and intuitive to understand.

The entire process goes like this:

  1. The initial subset contains all of the variables selected from the BIC and AIC search process along with the response variable Trnsfr

  2. We begin eliminating variables using the following procedure:

    • Find two highly correlated variables (e.g., Min and 90s)

    • Find their partial correlations with the response Trnsfr

    • Remove the variable that is less correlated with the response from the subset

    • Fit the model with all variables in the updated subset

    • Check the VIF

    • Subset the dataset again and repeat the process until the VIF is below 5 for every variable in the model

  3. We fit our resulting model and use it as the baseline model for the remainder of the project, this is baseline_mod

2.4.2 Assumption Validation for Final Baseline Model

Fitted vs. Residuals Plot for Final Baseline Model

# generate fitted and residual values and store in dataframe
base_assump_df = data.frame(fitted = fitted(baseline_mod),
                            residuals = resid(baseline_mod))

# fitted vs. residuals plot for baseline model
ggplot(base_assump_df, aes(x = fitted, y = residuals)) +
  geom_point(color = "dodgerblue", shape = 16) + 
  geom_hline(yintercept = 0, color = "orange", size = 1) + # horizontal line at 0
  labs(x = "Fitted", y = "Residuals") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5))

Explanation

  • base_assump_df is a dataframe housing the fitted values and the residuals of baseline_mod, which are generated using the fitted() and resid() functions

  • ggplot2 is used to graph the Fitted vs. Residuals plot

  • geom_hline is used to add a horizontal line at 0 for easier interpretation

Breusch-Pagan Test for Final Baseline Model

# breusch-pagan test
baseline_bp_test = bptest(baseline_mod)

# test statistic
baseline_bp_test_stat = baseline_bp_test$statistic

# p-value
baseline_bp_pval = baseline_bp_test$p.value

# print 
print(paste("The Test Statistic is:", baseline_bp_test_stat))
print(paste("The p-value is:", baseline_bp_pval))

Explanation

  • The bptest() function from the lmtest library is used to perform the Bresuch-Pagan Test

  • baseline_bp_test_stat and baseline_bp_pval hold the test statistic and p-value

Normal Q-Q Plot: Final Baseline Model

# normal q-q plot
qqnorm(resid(baseline_mod), 
       main = "Normal Q-Q Plot: Final Baseline Model",
       col = "dodgerblue")

# add q-q line
qqline(resid(baseline_mod), col = "orange", lwd = 2)

# grid-lines
grid()

Explanation

  • qqnorm() is used to generate the Normal Q-Q plot

  • qqline() adds the Q-Q line to the plot

Shapiro-Wilk Test for Final Baseline Model

# shapiro-wilk test
baseline_sw_test = shapiro.test(resid(baseline_mod))

# p-value
baseline_sw_test_pval = baseline_sw_test$p.value

# print
print(paste("The p-value is:", baseline_sw_test_pval))

Explanation

  • shapiro.test() is used to perform the Shapiro-Wilk Test

  • baseline_sw_test_pval holds the p-value from the test

Assumption Results for Final Baseline Model

# baseline assumption results dataframe
baseline_assumption_df = data.frame(Result = c("Passed", "Failed", "Failed"))
rownames(baseline_assumption_df) = c("Linearity Assumption", 
                                     "Homoscedasticity Assumption", 
                                     "Normality Assumption")

kable(baseline_assumption_df, align = "r")

Explanation

  • baseline_assumption_df holds the results for the assumption checking in a kable() table

2.5 Improving Model Validity

2.5.1 Unusual Observations

Searching for Influential Players

# add player names back to data
numeric_data = bind_cols(numeric_data, soccer["Player"])

# calculate cook's distances
distances = cooks.distance(baseline_mod)

# identify influential points using heuristic
infl_points = which(distances > 4 / length(distances))

# subset for influential players
infl_players = numeric_data[infl_points, ]

# neat table
kable(infl_players["Player"], 
      booktabs = TRUE, align = "r")

Explanation

  • bind_cols() is used to add the Player variable back to numeric_data

  • Cook’s distances are calculated using the function cooks.distances()

  • which() is used to identify the influential points

  • infl_players subsets numeric_data to obtain the names of the influential players

Removing Influential Players

# indexing for non-influential points only
non_infl_data = numeric_data[-infl_points, ]

# fitting no influential points model
non_infl_model = lm(formula = Trnsfr ~ Age + MP + Ast + `G-PK` + PKatt + PrgP + 
    PrgR + Six, data = non_infl_data)

Explanation

  • non_infl_data is a subset of numeric_data containing only non-influential players

  • non_infl_model is the baseline model fit with only non-influential data points

2.5.2 Assumption Validation for Improved Model

Breusch-Pagan Test for Non-Influential Model

# bp test for model with no influential points
non_infl_model_bp_test = bptest(non_infl_model)

# print
print(paste("The test statistic is:", non_infl_model_bp_test$statistic))
print(paste("The p-value is: ", non_infl_model_bp_test$p.value))

Explanation

  • bptest() function is used to test the constant variance assumption for the non-influential model

Fitted vs. Residuals for Non-Influential Model

# generate fitted and residual values and store in dataframe
imp_assume_df = data.frame(fitted = fitted(non_infl_model),
                            residuals = resid(non_infl_model))

# fitted vs. residuals plot for baseline model
ggplot(imp_assume_df, aes(x = fitted, y = residuals)) +
  geom_point(color = "dodgerblue", shape = 16) + 
  geom_hline(yintercept = 0, color = "orange", size = 1) + # horizontal line at 0
  labs(x = "Fitted", y = "Residuals", 
       title = "Fitted vs. Residuals: Non-Influential Model") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5))

Explanation

  • imp_assume_df contains the fitted values and residuals for the non-influential model

  • ggplot2 is used to generate the plot

Logarithmic Transformation Model

# fit log model
log_mod = lm(formula = log(Trnsfr) ~ Age + MP + Ast + `G-PK` + PKatt + PrgP + 
    PrgR + Six, data = non_infl_data)

# bp test for log model
log_mod_bp = bptest(log_mod)

# results from bp test
print(paste("The BP-Test test statistic is:", log_mod_bp$statistic))
print(paste("The BP-Test p-value is:", log_mod_bp$p.value))

# sw test for log model
log_mod_sw = shapiro.test(resid(log_mod))

# results from sw test 
print(paste("The Shapiro-Wilk p-value is :", log_mod_sw$p.value))

Explanation

  • log(Trsnfr) takes the logarithm of the response

  • bptest() is used to formally test for constant variances

  • shapiro.test() is used to formally test for normality

Square Root Transformation

# fit sqrt model
sqrt_mod = lm(formula = sqrt(Trnsfr) ~ Age + MP + Ast + `G-PK` + PKatt + PrgP + 
    PrgR + Six, data = non_infl_data)

# bp test for sqrt model
sqrt_mod_bp = bptest(sqrt_mod)

# results from bp test
print(paste("The BP-Test test statistic is:", sqrt_mod_bp$statistic))
print(paste("The BP-Test p-value is:", sqrt_mod_bp$p.value))

# sw test for sqrt model
sqrt_mod_sw = shapiro.test(resid(sqrt_mod))

# results from sw test 
print(paste("The Shapiro-Wilk p-value is :", sqrt_mod_sw$p.value))

Explanation

  • sqrt(Trnsfr) is the square root transformation of the response

  • Again bptest() and shapiro.test() are used to formally check the assumptions

Normal Q-Q Plot for Square Root Model

# normal q-q plot
qqnorm(resid(sqrt_mod), main = "Normal Q-Q Plot: Square Root Model", col = "dodgerblue")

# add q-q line
qqline(resid(sqrt_mod), col = "orange", lwd = 2)

# grid-lines
grid()

Explanation

  • qqnorm() plots the Normal Q-Q plot for square root model

  • qqline() adds the q-q line

Summary Table

sqrt_assumption_df = data.frame(Result = c("Passed", "Passed", "Passed (Uncertain)"))
rownames(sqrt_assumption_df) = c("Linearity Assumption", 
                                     "Homoscedasticity Assumption", 
                                     "Normality Assumption")

kable(sqrt_assumption_df, align = "r")

2.5.3 Robust Test Statistics

Rename Model

# rename model
final_mod = non_infl_model

Computing Robust Statistics

# compute robust standard errors
robust_se = sqrt(diag(vcovHC(final_mod, type = "HC3")))

# compute robust test statistics
robust_t_stats = coef(final_mod) / robust_se

# compute robust p-values
robust_p_values = 2 * pt(-abs(robust_t_stats), df = df.residual(final_mod))

# coefficients
final_coefs = summary(final_mod)$coefficients[, "Estimate"]

# regular test statistics
t_stats = summary(final_mod)$coefficients[, "t value"]

# regular p-values
p_values = summary(final_mod)$coefficients[, "Pr(>|t|)"]

# results dataframe
result_df = data.frame(coef = round(final_coefs, 5),
                       t = round(t_stats, 5),
                       pv = p_values,
                       rob_t = round(robust_t_stats, 5),
                       rob_pv = robust_p_values)
colnames(result_df) = c("Estimated Coefficient",
                        "Test Statistic",
                        "P-Value",
                        "Robust Test Statistic",
                        "Robust P-Value")

# neat table
kable(result_df, booktabs = TRUE, align = "r")

Explanation

  • sqrt(diag(vcovHC(final_mod, type = "HC3"))) calculates the robust standard errors

  • Robust test statistics are computed using coef(final_mod) / robust_se

  • Robust p-values are computed using 2 * pt(-abs(robust_t_stats), df = df.residual(final_mod))

  • final_coefs, t_stats, and p_values contain the estimates, regular test statistics, and the regular p-values from the original regression summary for final_mod

  • result_df is a dataframe containing the results

2.6 Some More Interesting Findings

2.6.1 Insignificant Variable

# fit nested model
final_less_prgr = lm(formula = Trnsfr ~ Age + MP + Ast + `G-PK` + PKatt + PrgP + Six, data = non_infl_data)

# anova F-Test
prgr_test = anova(final_less_prgr, final_mod)

paste("The p-value is:", prgr_test$`Pr(>F)`[2])

Explanation

  • final_less_prgr is the nested model

  • anova() is used to test whether PrgP is needed in the model

2.6.2 R-Squared

print(paste("The R-Squared is:", summary(final_mod)$r.squared))
print(paste("The Adjusted R-Squared is:", summary(final_mod)$adj.r.squared))

Explanation

  • summary(final_mod)$r.squared obtains the R-Squared of the regression

  • summary(final_mod)$adj.r.squared obtains the Adjusted R-Squared of the regression

2.6.3 Significance of the Regression

f = summary(final_mod)$fstatistic
p = pf(q = f[1], df1 = f[2], df2 = f[3], lower.tail = FALSE)

print(paste("The Significance of Regression Test Statistic is:", f[1]))
print(paste("The P-Value is:", p))

Explanation

  • f stores the test statistic and the degrees of freedom

  • p computes the p-value for the test statistic on it’s respective degrees of freedom

(3) Results

3.1 Exploratory Data Analysis Results

3.1.1 Correlation Plot

Feature-Engineered Six

The feature-engineered variable Six exhibits a moderate linear correlation of 0.52 with the target variable Trnsfr, suggesting that Six explains a significant portion of the variation in Trnsfr. This implies that including Six in the model could improve its predictive accuracy.

Progressive Variables (PrgR, PrgP, PrgC)

The progressive metrics show moderate correlations with Trnsfr, with all of the metrics having around 0.50 correlation with Trnsfr. This indicates that progressive actions, such as passes, carries, and receptions, play a significant role in determining a player’s market value. Including these variables could improve the model’s ability to capture the impact of player involvement in advancing play.

Discipline Metrics (CrdY, CrdR)

Although CrdY and CrdR are not strongly correlated with Trnsfr, their inclusion in the model provides a reflection of a player’s discipline and mentality. This insight could be particularly useful in cases where clubs value player temperament as part of their transfer decision-making process.

Multicollinearity

The actual performance metrics, such as Gls, Ast, and others, are highly correlated with their expected counterparts (xG, xAG, etc.). This is expected since the expected metrics are derived from underlying performance data and probabilities that align closely with actual outcomes. While both sets of variables provide valuable insights, their high correlation could lead to multicollinearity. To avoid inflating variance, we’ll need to carefully consider which set of metrics to include in the model.

Born and Age are almost perfectly negatively correlated (-0.99), as they are inverse representations of the same concept. Including both variables would be redundant, so we should retain Age as it offers a more intuitive understanding of player maturity and experience.

Game time metrics such as Min, Starts, 90s, and MP are highly correlated, as expected because they measure different aspects of playing time. Including all of them could lead to multicollinearity issues. Therefore, it would be beneficial to select one or two representative metrics to ensure the variance is not inflated and the model remains interpretable.

3.1.2 Scatterplots

Age vs. Trnsfr

Age has a quadratic pattern on Trnsfr, which aligns with the trend that younger and older players generally have lower market values. This is because young players are in their developmental stage while older players are potentially regressing. Including Age as a quadratic term in the model would capture this relationship, allowing the model to better reflect the influence of age on transfer value.

High Variance Across Performance Metrics

Nearly all performance metrics including actual (e.g., Gls), expected (e.g., xG), and progressive (e.g., PrgC) exhibit signs of high variance. This suggests that logging these variables could be beneficial. By applying a logarithmic transformation, we can compress extreme values, reducing the impact of outliers and stabilizing the variance. This transformation also makes relationships more linear and interpretable, which improves the model’s performance and robustness by ensuring more consistent coefficient estimates.

3.1.3 BoxPlots and Barplots

Pos

Pos indicates whether a player is a midfielder (0) or an attacker (1).

The highest market value belongs to an attacker (Erling Haaland), which is unsurprising given the premium placed on elite goal-scorers in the transfer market.

The median Trnsfr value across positions is relatively similar, indicating that positional differences do not drastically affect median valuations.

However, the spread of Trnsfr values for forwards is notably high, reflecting a wider range of market values for attackers. This is likely driven by varying goal-scoring capabilities and demand.

The distribution of players by position is relatively even, with attackers making up 43% of the sample and midfielders comprising the rest. This balance allows for fair comparison across positions in the model.

Six

Six indicates whether a player plays for a Big Six club (1) or does not play for a Big Six club (0).

The boxplot reveals a significant discrepancy in median Trnsfr values, with Big Six players having substantially higher median market values compared to players from other clubs. This discrepancy is logical, given the stronger financial backing and higher performance expectations at these clubs.

The variation within the Big Six group is high, as reflected by the wider spread in their boxplot. This is partly due to their smaller sample size, comprising only about 30% of the dataset, and the diverse market valuations of players across these top-tier teams.

Nation

Nation is a factor with levels indicating the nationality of a player.

English players are the most common nationality, followed by players from major soccer nations such as Brazil and France. This reflects the Premier League’s strong domestic influence and its international appeal, particularly from historically strong soccer countries.

When summing Trnsfr by nationality, English players have the highest total valuation. This likely reflects the “English premium”, where domestic players often command higher transfer fees due to league regulations, fan appeal, and homegrown status requirements.

3.1.4 Descriptive Statistics

Mean StDev Median Trimmed Mean Min Max Range
Pos 0.43154 0.49632 0.0 0.41451 0.0 1.0 1.0
Age 24.94606 3.82878 25.0 24.80829 17.0 35.0 18.0
Born 1997.68050 3.84513 1998.0 1997.81347 1988.0 2006.0 18.0
MP 26.04149 8.20711 28.0 26.47150 10.0 38.0 28.0
Starts 17.90041 10.64652 17.0 17.80311 0.0 37.0 37.0
Min 1596.23651 874.91197 1495.0 1581.51813 79.0 3325.0 3246.0
90s 17.73361 9.71976 16.6 17.57098 0.9 36.9 36.0
Gls 4.21577 4.73233 3.0 3.37306 0.0 27.0 27.0
Ast 2.66805 2.92079 2.0 2.17617 0.0 13.0 13.0
G+A 6.88382 6.79545 5.0 5.86528 0.0 33.0 33.0
G-PK 3.81743 4.11500 2.0 3.10881 0.0 20.0 20.0
PK 0.39834 1.17218 0.0 0.08808 0.0 9.0 9.0
PKatt 0.44398 1.29018 0.0 0.10363 0.0 9.0 9.0
CrdY 3.80498 3.07668 3.0 3.49741 0.0 13.0 13.0
CrdR 0.10373 0.33170 0.0 0.00000 0.0 2.0 2.0
xG 4.19129 4.39114 2.7 3.40829 0.0 29.2 29.2
npxG 3.84025 3.76456 2.7 3.20466 0.0 22.9 22.9
xAG 2.70000 2.55852 1.9 2.27876 0.0 11.8 11.8
npxG+xAG 6.54315 5.58418 5.2 5.72383 0.1 27.3 27.2
PrgC 41.29046 37.28894 29.0 35.70984 0.0 218.0 218.0
PrgP 72.03734 63.04782 56.0 62.52850 2.0 376.0 374.0
PrgR 90.51867 81.27192 64.0 78.54922 1.0 508.0 507.0
Trnsfr 28.71846 27.25455 20.0 24.50518 0.4 180.0 179.6
Six 0.29876 0.45866 0.0 0.24870 0.0 1.0 1.0

Mean of Age

The average age is 25, aligning with the typical peak of a player’s physical and athletic prime. This age distribution makes sense, as clubs often prioritize players in their prime for both performance and transfer value.

Playing Time

On average, players participate in 26 matches and start 18, indicating a considerable level of rotation within teams. This could reflect squad depth, tactical flexibility, or player fitness management over the season.

Outlier Influence

The trimmed mean for Gls highlights how significantly Erling Haaland skews the average, given his exceptional goal-scoring record. By trimming extreme values, the mean offers a more balanced reflection of typical goal tallies across players, mitigating the influence of outliers like Haaland.

Similarly, expected and progressive metrics are heavily influenced by outliers. Players with exceptionally high values in these categories can distort averages, emphasizing the need for strategies such as logging or trimming to handle variance and maintain model reliability.

Aggression

Most players receive an average of four yellow cards per season, although one player received 13 which demonstrates an extremely aggressive playstyle. This notable outlier indicates a disciplinary concern and could impact a player’s market perception if such tendencies are considered detrimental.

The maximum number of red cards is 2, suggesting that extreme disciplinary actions are relatively rare. However, it is still impactful enough to consider when evaluating a player’s discipline or risk factor.

Big Six Representation

Approximately 29.8% of players belong to a Big Six club. This highlights the Premier League’s concentration of top talent within a select group of elite clubs, which likely contributes to their disproportionate impact on market valuations.

Transfer Market Value

The average transfer value is £28.7M, with a trimmed mean of £24.5M. The maximum value, £180M, once again reflects Haaland’s market dominance. This significant gap suggests that a small number of high-value players could be inflating the overall average.

3.1.5 Distribution of Response Variable

Trnsfr Findings

The histogram of Trnsfr reveals a right-skewed distribution with noticeable outliers on the far right, representing players with exceptionally high transfer market values. These outliers significantly inflate the mean and can introduce skewness into the model. This can affect the accuracy and reliability of the model.

3.1.6 Conclusion and Next Steps for Model Development

Based on the findings from our data analysis, we are ready to proceed with building a series of regression models to understand Trnsfr (transfer market value). Our approach will be systematic, first establishing a baseline model then determining if the regression assumptions hold. If they do not, we consider other alternatives such as removing influential points or using robust test statistics.

Here is the plan:

Step 1: Variable Selection for the Baseline Model

We will first establish a baseline model. To do this, we will:

  1. Review correlations

  2. Address multicollinearity

Step 2: Check Assumptions

Once we have our baseline model, we will validate the regression assumptions to ensure robustness:

The regression assumptions are:

  • Linearity

  • Homoscedasticity (constant variance)

  • Normality of errors (this is really only important for inference, and even then, we should always compute robust test statistics)

Step 3: Improve Model Robustness

Once we determine which tests the model passes and fails, we attempt to make the model more robust.

3.2 Baseline Model Development Results

3.2.1 Search Procedure

To present the baseline model, we will first outline the search procedure.

The search procedure begins by fitting a full additive model:

\[ \text{Trnsfr} = X^T\beta \]

Where:

  • \(\text{Trnsfr}\) is the response variable

  • \(X^T\) is the transpose of the design matrix \(X\), which contains a column of ones for the intercept and all of the variables

  • \(\beta\) is the scaler vector of weights corresponding to the variables in \(X^T\)

We then use AIC and BIC backwards and exhaustive searches to select our model. To do this in R, we use the step() function. In order to understand the step() function, we must first understand AIC and BIC.

AIC

The Akaike Information Criterion (AIC) is used for model variable selection by balancing model fit and complexity. The formula is:

\[ AIC = -2 \log L(\hat{\beta}, \hat{\sigma}^2) + 2p = n + n \log(2\pi) + n \log\left(\frac{RSS}{n}\right) + 2p \]

Where:

  • \(L(\hat{\beta}, \hat{\sigma}^2)\) is the likelihood function evaluated at the estimated parameters

  • \(p\) is the number of parameters in the model

  • \(RSS\) is the residual sum of squares

  • \(n\) is the sample size

The penalty term in the AIC is \(2p\) because it is large when \(p\) is large but we are searching for a low AIC.

Therefore, a good model (one with low AIC) will balance between fitting well and using a small number of parameters.

BIC

The Bayesian Information Criterion (BIC) is very similar to the AIC except that it has a higher penalty term. The formula is:

\[ BIC = -2 \log L(\hat{\beta}, \hat{\sigma}^2) + \log(n) \cdot p = n + n \log(2\pi) + n \log\left(\frac{RSS}{n}\right) + \log(n) \cdot p \]

The penalty term in this case is \(\log(n) \cdot p\), which is why we use k = log(n) when using BIC search for the step() function.

Similarly, a good model (one with low BIC) will balance between fitting well and using a small number of parameters. However, BIC tends to be stricter and uses less parameters than AIC due to the higher penalty term.

Backward Procedure

For the backward step-wise procedure, the step() function starts with the full additive model specified above. Then, R iteratively removes one variable at a time, recalculating the AIC at each step. The process continues until removing any further variables no longer improves the AIC. The model with the lowest (best) AIC is returned as the final model.

The same follows for the BIC.

3.2.2 Model Obtained from AIC and BIC Searches

Interestingly, we obtain the several models from each of our searches. However, the BIC searches return models with low VIF (Variance Inflation Factors) and we want to address high collinearity amongst the variables so we settle for the AIC backward and exhaustive search models. Note there is no “correct” model, and that the choice truly depends on the types of questions one wishes to answer. In this case, we want to address high collinearity amongst the variables such as Gls and xG because that makes for a more interesting project! Hence, the chosen model is:

\[ \text{Trnsfr} = \beta_0 + \beta_1\text{Age} + \beta_2\text{MP} + \beta_3\text{Min} + \beta_4\text{90s} + \beta_5\text{Gls} + \beta_6\text{Ast} + \beta_7\text{G-PK} + \beta_8\text{PKatt} + \beta_9\text{xG} + \beta_{10}\text{npxG+xAG} + \beta_{11}\text{PrgP} + \beta_{12}\text{PrgR} + \beta_{13}\text{Six} + \epsilon_i \] Where \(\epsilon_i\) is the error.

3.2.3 Addressing High Collinearity

From our exploratory data analysis (EDA), it is evident that certain predictor variables exhibit high correlations. Specifically:

  • MP, Min, and 90s are all measures of playing time and are naturally correlated

  • G-PK, PKatt, xG, and npxG represent performance metrics and are interrelated

High collinearity does not prevent model estimation (unlike perfect collinearity) but it still affects the model in negative ways:

  • Unreliable Standard Errors: The standard errors of the variables are inflated, making it harder to assess the statistical significance of variables

  • Unstable Coefficients: The estimated coefficients are highly unstable, where small changes in data cause large swings in estimated coefficients

Note that high collinearity does not necessarily impact the predictive accuracy of our model. We can still generate reliable predictions despite the presence of high collinearity However, prediction is not the primary goal of this analysis. Our objective is to model and interpret the relationships between the variables and the response variable Trnsfr.

When interpretation is the focus, high collinearity becomes problematic because it obscures the individual contribution of each variable. For example, while our model may suggest that playing time and performance are significant drivers of transfer value, we cannot clearly determine the extent to which Min, MP, or 90s independently influence Trnsfr due to their high collinearity.

Thus, reducing high collinearity will improve the interpretability of our coefficients while maintaining the model’s robustness. This ensures that any inferences drawn from the model are both statistically reliable and practically meaningful.

VIF

To identify high collinearity in the model, we will use the Variance Inflation Factor (VIF) for each variable. The formula for the VIF is as follows:

\[ \text{VIF}_i = \frac{1}{1 - R_i^2} \] Where \(R_i^2\) is the coefficient of determination obtained by regressing the \(i\)-th variable on all the other variables in the model. In other words, \(R_i^2\) represents the proportion of variance in the \(i\)-th variable that can be explained by the remaining variables. A high \(R_i^2\) suggests that the \(i\)-th variable is highly collinear with others, which leads to a high VIF.

The VIF can be interpreted (loosely because there is no set rule) as follows:

  • \(VIF_i < 5\) is generally acceptable

  • \(VIF_i < 10\) is also acceptable, but we will opt for the stricter \(VIF_i < 5\) threshold

  • \(VIF_i > 10\) suggests high collinearity

VIF for Step-wise Search Model

VIF
Age 1.09407
MP 4.79264
Min 118351.60795
90s 118324.97937
Gls 907.29183
Ast 4.98864
G-PK 711.24471
PKatt 3769.59091
xG 62074.49639
npxG 45479.37893
npxG+xAG 36.93123
PrgP 4.21782
PrgR 3.71322
Six 1.37001

Using the model obtained from the step-wise selection procedure, we calculate the VIF for all variables. We see that Min, 90s, Gls, G-PK, PKatt, xG, npxG and npxG+xAG have exceedingly high VIF values, suggesting extremely strong collinearity with other variables in the model (as expected from our EDA). These variables will be the primary focus for addressing high collinearity.

Iterative Variable Elimination Using Partial Correlation Coefficients

We proceed by iteratively removing variables based on their partial correlation coefficients with the response variable Trnsfr. The process is as follows:

  1. Identify two highly correlated variables with high VIF values

  2. Compute the partial correlations of each variable with Trnsfr while controlling for all other variables

    • The partial correlation is computed as follows:

      • Regress the response Trnsfr on all variables except variable of interest

      • Regress variable of interest on all variables except the response

      • Compute the correlation of the residuals of the models to obtain the partial correlation

  3. Drop the variable with a weaker partial correlation with Trnsfr

  4. Refit the model and recalculate VIF after each drop

  5. Repeat until all VIF values are below 5

Iteration 1: Drop 90s

The variables of interest are 90s and Min because they both measure playing time metrics.

We drop 90s because the magnitude of its partial correlation coefficient is less than the magnitude of the partial correlation of Min.

Iteration 2: Drop xG

The variables of interest are xG and npxG because they both measure expected goals scored except npxG does not include penalty goals.

We drop xG because the magnitude of its partial correlation coefficient is less than the magnitude of the partial correlation of npxG.

Iteration 3: Drop Gls

The variables of interest are Gls and G-PK because they both measure goals scored except G-PK does not include penalty goals.

We drop Gls because the magnitude of its partial correlation coefficient is less than the magnitude of the partial correlation of G-PK.

Iteration 4: Drop npxG

The variables of interest are npxG and npxG+xAG because, fairly obviously, npxG is part of npxG+xAG.

We drop npxG because the magnitude of its partial correlation coefficient is less than the magnitude of the partial correlation of npxG+xAG.

Iteration 5: Drop npxG+xAG

The variables of interest are now G-PK and npxG+xAG because these still have high VIF values and G-PK likely directly influences the npxG metric.

We drop npxG+xAG because the magnitude of its partial correlation coefficient is less than the magnitude of the partial correlation of G-PK.

Iteration 6: Drop Min

The results after the fifth iteration show that MP does not appear to have an inflated variance but Min does, and because they are both playing time metrics we consider them.

We drop Min because the magnitude of its partial correlation coefficient is less than the magnitude of the partial correlation of MP.

3.2.4 Final Baseline Model

After dropping 90s, xG, Gls, npxG, npxG+xAG, and Min, the final baseline model is:

\[ \text{Trnsfr} = \beta_0 + \beta_1\text{Age} + \beta_2\text{MP} + \beta_3\text{Ast} + \beta_4\text{G-PK} + \beta_5\text{PKatt} + \beta_6\text{PrgP} + \beta_7\text{PrgR} + \beta_8\text{Six} + \epsilon_i \]

VIF for Final Baseline Model

VIF
Age 1.04337
MP 2.42184
Ast 2.67127
G-PK 2.18961
PKatt 1.31013
PrgP 2.46744
PrgR 2.57076
Six 1.21252

As we can see, the VIF values are now all below our desired threshold so we proceed to assumption validation for the final baseline model.

3.2.5 Assumption Validation for Baseline Model

Validation Methods

We are interested in the following assumptions:

  • Linearity

  • Homoscedasticity (Constant variance)

  • Normality of Errors

Linearity

To check for the linearity assumption, we use a Fitted vs. Residuals plot.

If at any fitted value in the plot, the mean of the residuals is approximately 0, then the linearity assumption holds.

Homoscedasticity

To check for constant variance, we again use the Fitted vs. Residuals plot.

If the spread of the residuals is approximately equal at any fitted value, then the constant variance assumption holds.

We can also check for constant variance using the Breusch-Pagan Test. The null and alternative hypotheses are as follows:

\(H_0\): Homoscedasticity, the errors have constant variance about the true model

\(H_1\): Heteroscedasticity, the errors do not have constant variance about the true model

Hence, if we reject \(H_0\) we conclude that there is heteroscedasticity (non-constant variance) in the model.

Normality of Errors

To check the normality assumption, we use the Normal Quantile-Quantile plot.

If the points on the plot do not follow a straight line (or approximately straight), then the plot suggests the data does not come from a normal distribution.

We can also check for normality using the Shapiro-Wilk Test. The null and alternative hypotheses are as follows:

\(H_0\): Data is sampled from a normal distribution

\(H_1\): Data is not sampled from a normal distribution

Hence, if we fail to reject \(H_0\), we can conclude that the data is likely to come from a normal distribution.

Linearity and Homoscedasticity

From the plot, we observe that the majority of the points are scattered near 0 without any discernible pattern, which suggests that the linearity assumption holds for the baseline model. However, there are some larger residuals that are further from 0 which could indicate the presence of outliers or influential points that may be affecting the model fit. Note that even when the linearity assumption is not valid, OLS still generates the best linear approximation (so either way, we are good to go!).

Importantly, the residuals form a funnel pattern where the spread of residuals increases as the fitted values increase. This pattern suggests that the variance of the residuals is likely non-constant, which indicates potential heteroscedasticity (i.e., violation of the assumption of constant variance). To formally test for heteroscedasticity, we will perform the Breusch-Pagan Test. We obtain the following results:

## [1] "The Test Statistic is: 79.8982341662703"
## [1] "The p-value is: 0.0000000000000512495663338689"

From the Breusch-Pagan Test, we obtain a very high test statistic and a p-value close to 0. As a result, we reject the null hypothesis of homoscedasticity at the common significance levels (1%, 5%, 10%) and conclude that there is heteroscedasticity in the model (i.e., the errors do not have constant variance). This means the assumption of constant variance is violated, and any statistical inference (such as confidence intervals or hypothesis tests) made based on the baseline model may not be valid. Note that this issue can be addressed by using heteroscedastic robust test statistics.

Normality Assumption

The resulting Normal Q-Q plot suggests that the model may not meet the normality assumption. While the points are roughly aligned along the reference line in the middle, there are deviations in the tails, indicating that the residuals may not follow a normal distribution. This could suggest the presence of heavy tails or outliers. To formally assess normality, we conduct the Shapiro-Wilk Test, which will help determine whether the residuals deviate significantly from a normal distribution. Here are the results:

## [1] "The p-value is: 0.0000000806895886292323"

We obtain a p-value that is rejected at the 1%, 5%, and 10% significance levels. We conclude that the residuals likely do not follow a normal distribution.

Summary

Here are the combined assumption validation results for the baseline model:

Result
Linearity Assumption Passed
Homoscedasticity Assumption Failed
Normality Assumption Failed

We now attempt to improve the validity of the model.

3.3 Improving Model Validity Results

3.3.1 Unusual Observations

We first search for unusual observations in the data.

As we have seen in the EDA, there are some high-profile players who are skewing the distribution of the response. In order to identify whether these players affect the regression or not, we must determine if they are “influential”.

An influential data point refers to outliers that have both a high leverage (i.e., could have a large influence when fitting the model) and a large residual. To determine whether a point is influential or not, we use Cook’s Distance. The formula is:

\[ D_i = \frac{1}{p} \cdot \frac{r_i^2 \cdot h_i}{1 - h_i} \]

Where \(D_i\) is Cook’s Distance and \(p\) is the number of parameters in the model.

\(r_i\) is the standardized residual for observation \(i\), calculated as:

\[ r_i = \frac{e_i}{s_e \cdot \sqrt{1 - h_i}}\ \]

Where \(e_i\) is the raw residual for observation \(i\), \(SE(e_i) = s_e \sqrt{1 - h_i}\) is the standardized residual, and \(h_i\) is the leverage for observation \(i\).

Finally, \(h_i\) is considered to be a larger leverage for observation \(i\) if \(h_i > 2\bar{h}\).

Note that a Cook’s Distance is considered large if:

\[ D_i > \frac{4}{n} \]

Influential Players

Player
Bukayo Saka
Declan Rice
Gabriel Martinelli
Pascal Groß
Christopher Nkunku
Cole Palmer
Nicolas Jackson
João Palhinha
Mohamed Salah
Carlton Morris
Erling Haaland
Phil Foden
Rodri
Bruno Fernandes
Bruno Guimarães
Son Heung-min
Pedro Neto

Using Cook’s Distance we now see the influential players in the dataset and, unsurprisingly, the majority of them are players who are considered world-class (or essentially on the cusp of being world-class).

Here are some reasons for why these players may be influential:

  • Saka had an exceptional season, recording 16 goals and 9 assists

  • Rice had a combined 15 goals and assists along with 278 progressive passes (PrgP)

  • Martinelli completed 345 progressive turns (PrgR)

  • Groß recorded a notable 10 assists and completed 302 progressive passes

  • Nkunku’s 19 progressive passes and 35 turns could be making him influential, although he only featured in 11 matches (which may be another reason)

  • Palmer had an exceptional season, recording 22 goals and 11 assists

  • Jackson had a prolific goal scoring record with 14 goals scored

  • Palhinha recorded 97 progressive passes

  • Salah had an exceptional season, scoring 18 goals and assisting 10 times

  • Morris had 15 goals and assists, along with 120 progressive turns

  • Haaland had an exceptional season, scoring 27 goals (won the golden boot)

  • Foden had an exceptional season, scoring 19 goals and assisting 8 times

  • Rodri had an exceptional season, obtaining a total of 17 goals and assists while also completing 376 progressive passes

  • Fernandes had 10 goals and 8 assists

  • Guimarães had 15 total goals and assists while also recording 283 progressive passes

  • Son Heung-min had an exceptional season, scoring 17 goals and assisting 10 times

  • Neto had 149 turns despite only playing 20 games total, and also obtained 9 assists

Dealing with Influential Players

Given the influential players are anomalies (i.e., a level above the rest of the players) and not representative of the general population of Premier League players, we will exclude these players from the improved model.

This is not necessarily the “correct” decision. However, when we remove the players, we see that the constant variance assumption is satisfied.

3.3.2 Assumption Validation for Improved Model

Constant Variance and Linearity

## [1] "The test statistic is: 11.2591254003293"
## [1] "The p-value is:  0.187444622505481"

Based on the p-value, we fail to reject the null for the Breusch-Pagan Test and conclude that the errors have constant variance.

We can also check the Fitted vs. Residuals plot for the model with non-influential points.

We see an improvement in the fitted vs. residuals plot as there is no longer a clear funnel shape and the variance looks more equal. This further supports the validity of the Breusch-Pagan Test.

So now our model passes two assumptions:

  • Linearity

  • Constant Variance (homoscedasticity)

We will now consider the normality assumption.

Normality Assumption for Improved Model

Normality in the errors is typically achieved by transforming the response variable. This may include methods such as logarithmic or square roots. Here, we try both methods.

We will check the normality assumption, and also use the Breusch-Pagan Test to determine whether the transformation affected the constant variance assumption.

Logarithmic Transformation of Response

We try a logarithmic first.

## [1] "The BP-Test test statistic is: 19.5715285577208"
## [1] "The BP-Test p-value is: 0.0120845386492388"
## [1] "The Shapiro-Wilk p-value is : 0.00000285680302256805"

We see immediately that a logarithmic transformation does not improve our model’s validity. In fact, using a log transformation causes the constant variance assumption to fail. Furthermore, we reject the null hypothesis for the Shapiro-Wilk Test and conclude again that the data is not normally distributed even with a log transformation.

Square Root Tranformation of Response

We try the square root transformation next.

## [1] "The BP-Test test statistic is: 13.1135981012104"
## [1] "The BP-Test p-value is: 0.108000371066453"
## [1] "The Shapiro-Wilk p-value is : 0.0453737582842567"

The Breusch-Pagan Test still passes at the 1%, 5%, and 10% significance levels, confirming that the square root model satisfies the homoscedasticity assumption.

For the Shapiro-Wilk Test, we obtain a p-value of about 0.0454. The results are as follows:

  • Fail to reject at the 1% significance levels

  • Reject at 5% and 10% significance levels

Since we fail to reject the null hypothesis at the more strict 1% level, it suggests that there is no significant deviation from normality at that threshold. However, the rejection at the 5% and 10% levels indicates that at less strict significance levels, there is some evidence that the data may not follow a perfect normal distribution (although this could be type 1 error).

We can investigate further by checking the Normal Q-Q plot.

Note that after applying the square root transformation, the deviations in the tails of the Q-Q plot are less pronounced, providing further evidence that the data is more closely approximating a normal distribution. This improvement suggests that the transformation has effectively mitigated the skewness, reinforcing the assumption of normality for the regression analysis.

However, due to the uncertainty raised by the Shapiro-Wilk Test, we proceed by computing both robust test statistics and standard test statistics to ensure a more reliable and comprehensive assessment of the model’s validity.

Summary

Here are the combined assumption validation results for the improved model (the square root model).

Result
Linearity Assumption Passed
Homoscedasticity Assumption Passed
Normality Assumption Passed (Uncertain)

We continue by computing robust test statistics.

Also note that by using a square root transformation, the interpretation of the model changes drastically.

For example, consider the model: \(\sqrt{Y} = a + bX\)

Then the change with respect to X is: \(Y = 2b (a + bX)\)

Although the square root transformation helped resolve the normality problem, we see that the interpretation is challenging for just a simple model. For this reason, we proceed by using the baseline model without influential-points and computing test statistics for it.

3.3.3 Robust Test Statistics

Robust test statistics rely on robust standard errors, which remain valid even if the normality assumption is violated. These standard errors adjust for heteroscedasticity and enhance the reliability of hypothesis tests, particularly when residuals are not homoscedastic.

Violations of normality mainly affect smaller sample sizes, as they can lead to unreliable p-values. However, with large samples, the Central Limit Theorem ensures that the sampling distribution of estimates tends to normality, reducing the impact of non-normal residuals. Given our sample size of 241 observations and 8 independent variables, this is sufficiently large for reliable inference. Hence, even with non-normal residuals, robust standard errors provide reliable results.

The model we will be using is non_infl_mod which is the baseline model with only non-influential players. This model fails the normality assumption but satisfies the linearity and constant variance assumptions. Therefore, it is a great candidate to compute robust statistics for!

Non-Influential (Final) Model Results

Estimated Coefficient Test Statistic P-Value Robust Test Statistic Robust P-Value
(Intercept) 45.66398 8.12527 0.0000000 8.27192 0.0000000
Age -1.45567 -7.27040 0.0000000 -7.25703 0.0000000
MP -0.44280 -2.95427 0.0034831 -2.88619 0.0042970
Ast 1.44378 3.39012 0.0008311 3.11182 0.0021117
G-PK 1.74733 5.74892 0.0000000 5.64104 0.0000001
PKatt 3.57386 3.45699 0.0006583 2.83546 0.0050127
PrgP 0.15234 6.88059 0.0000000 6.19323 0.0000000
PrgR 0.02676 1.58630 0.1141418 1.30105 0.1946325
Six 15.81705 8.61450 0.0000000 7.60311 0.0000000

We observe that the regular and robust test statistics are very similar, with both sets of p-values consistently indicating the significance of most variables. This alignment suggests that despite the minor concerns regarding the normality assumption, the results from the robust statistics confirm the validity of the model’s findings. In particular, the variables Age, MP, Ast, G-PK, and Six remain significant, even when accounting for potential deviations from normality. The fact that both regular and robust statistics produce similar results indicates that the model is likely robust to normality violations, reinforcing the strength and reliability of the conclusions.

(4) Some More Interesting Findings

4.1 Non-Significant Variable

From our results, we see that the variable PrgR is not significant at the 1%, 5%, or 10% significance levels.

We can use an F-Test to determine whether to retain the variable or not.

## [1] "The p-value is: 0.114141797383035"

The p-value indicates that PrgP is likely not needed in the regression.

4.2 R-Squared

## [1] "The R-Squared is: 0.711275295300586"
## [1] "The Adjusted R-Squared is: 0.700532050474562"

We obtain an R-Squared and Adjusted R-Squared of about 0.7-0.71, meaning about 70-71% of the variation in the response Trnsfr is explained by the regression.

4.3 Significance of the Regression

## [1] "The Significance of Regression Test Statistic is: 66.2067472927337"
## [1] "The P-Value is: 0.00000000000000000000000000000000000000000000000000000799973061319306"

Since the p-value is extremely close to 0, we reject the null that the regression is insignificant. We conclude that the regression is significant (i.e., at LEAST one of the variables in the regression has a significant linear relationship with the response).